Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases
نویسندگان
چکیده
Accurate and efficient entity resolution is an open challenge of particular relevance to intelligence organisations that collect large datasets from disparate sources with differing levels of quality and standard. Starting from a first-principles formulation of entity resolution, this paper presents a novel Entity Resolution algorithm that introduces a data-driven blocking and record linkage technique based on the probabilistic identification of entity signatures in data. The scalability and accuracy of the proposed algorithm are evaluated using benchmark datasets and shown to achieve state-of-theart results. The proposed algorithm can be implemented simply on modern parallel databases, which allows it to be deployed with relative ease in large industrial applications.
منابع مشابه
Towards a Scalable and Robust Entity Resolution -Approximate Blocking with Semantic Constraints
Entity resolution, or record linkage, is the process that identifies data records over one or more datasets which refer to the same real world entity. To deal with large datasets, many real-life applications require scalable and high-quality entity resolution techniques. Blocking techniques can help to scale-up the entity resolution process. Locality sensitive hashing (LSH) is an approximate bl...
متن کاملOn Entity Resolution for Probabilistic Data
Entity resolution (ER) is the problem of identifying duplicate tuples, which are the tuples that represent the same real-world entity. There are many real-life applications in which the ER problem arises. These applications range from news aggregation websites, identifying the news that cover the same story, in order to avoid presenting one story several times to the user, to the integration of...
متن کاملFactorized Databases: Past and Future Past
In this talk I will overview the FDB project at Oxford on succinct, lossless representations of relational data that I call factorized databases. I will first present a characterization of the succinctness of results to conjunctive queries and how factorizations can speed up query processing.I will then comment on how this succinctness characterization relates to seemingly disparate results on:...
متن کاملThe Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution
This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...
متن کاملEntity Resolution Acceleration using Micron’s Automata Processor
Entity Resolution (ER), the process of finding identical entities across different databases, is critical to many information integration applications. As sizes of databases explode in the big-data era, it becomes computationally expensive to recognize identical entities for all possible records with variations allowed. Profiling results show that approximate matching is the primary bottleneck....
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1712.09691 شماره
صفحات -
تاریخ انتشار 2017